Skip to content

fix(kv): make txn abort idempotent when rollback marker exists#581

Merged
bootjp merged 5 commits intomainfrom
fix/idempotent-rollback-marker
Apr 22, 2026
Merged

fix(kv): make txn abort idempotent when rollback marker exists#581
bootjp merged 5 commits intomainfrom
fix/idempotent-rollback-marker

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 22, 2026

Summary

Fix a 2PC abort-retry race that surfaces in prod as "secondary write failed" ... "write conflict" log spam on !txn|rb| rollback-marker keys, and close the reciprocal commit-after-abort gap flagged in review.

"msg":"secondary write failed","cmd":"EVALSHA",
"err":"<string>:118: key: !txn|rb|!redis|ttl|misskey.bootjp.me:queue:...:stalled-check\^A\ufffd-\^C9\ufffd\^@\^@: write conflict"

Root cause

handleAbortRequest is not idempotent. Once a (primaryKey, startTS) pair has been aborted, the rollback marker !txn|rb|<primaryKey>+<startTS> sits at commitTS = abortTS. A second abort of the same pair — from a concurrent lock resolver, a retry, or a dualwrite async replay — rebuilds:

  • Delete on the already-tombstoned lock / intent keys
  • Put on the already-present rollback marker

Every mutation has latestCommitTS = abortTS > startTS, so MVCC checkConflicts rejects all three as ErrWriteConflict. TestFSMAbort_SecondAbortSameTimestampConflicts was literally pinning this bug's current behaviour.

Fix (per-key idempotency, NOT marker short-circuit)

The ABORT request is not short-circuited on rollback-marker presence. The marker only proves that some prior abort for (primaryKey, startTS) ran; it does not prove that cleanup ran for the specific keys in this request. In particular ShardStore.tryAbortExpiredPrimary issues an ABORT whose mutation list contains only the primary key, so a later lock-resolver abort for a secondary key (same primaryKey, same startTS) would see the marker already present and must still clean up that secondary's lock / intent.

Idempotency is enforced per key:

  • shouldClearAbortKey: if the lock is already gone, skip cleanup for that key (no stale Delete on a tombstoned lock/intent).
  • appendRollbackRecord: if the rollback marker is already present for (primaryKey, startTS), skip the Put (no byte-identical re-Put that MVCC would reject as a write conflict).

Both cost one extra GetAt on the hot abort path; the common case (fresh abort) is unaffected.

Symmetric commit / rollback invariant

Desired invariant: for any (primaryKey, startTS), at most one of {rollback marker, commit record} is ever present. The invariant is enforced by symmetric guards rather than assumed:

  • appendRollbackRecord: verifies the commit record is absent on both the first-time and idempotent paths (not just the first-time path), returning ErrTxnAlreadyCommitted if a commit record already exists.
  • commitApplyStartTS: verifies the rollback marker is absent before writing a commit record, returning a new ErrTxnAlreadyAborted if one exists. This catches out-of-order apply (COMMIT after ABORT), buggy clients, replay races — the case addressed directly by the review.

Test plan

  • go test -race -count=1 ./kv/... green
  • make lint clean
  • TestFSMAbort_SecondAbortSameKeysIsIdempotent — same-abortTS and later-abortTS retry (HLC-monotonic lock-resolver race path)
  • TestFSMAbort_SecondAbortDifferentKeysCleansRemainder — primary-only first abort followed by secondary-key abort must still clean the secondary (regression for a would-be over-broad marker short-circuit)
  • TestFSMAbort_RejectsAlreadyCommittedTxn — abort-after-commit rejected with ErrTxnAlreadyCommitted
  • TestFSMAbort_CommitAfterAbortIsRejected — commit-after-abort rejected with ErrTxnAlreadyAborted (new symmetric guard)
  • Deploy; "secondary write failed" log rate on !txn|rb| should drop to zero

Relates to the BullMQ stalled-check traffic class across the relationship, deliver, objectStorage, and webhook queues.

Summary by CodeRabbit

  • Bug Fixes

    • Commits for transactions that were previously aborted are now rejected.
    • Abort operations are idempotent — retries (even at later timestamps) succeed without error.
    • Distributed lock cleanup across primary/secondary keys is more reliable; no orphaned locks remain.
    • No unnecessary lock deletions occur when locks are already absent.
  • Tests

    • Added coverage validating abort idempotency, lock-resolver races, and commit-after-abort rejection.

Production log spam:
  "secondary write failed" ... "write conflict"
  key: !txn|rb|!redis|ttl|<BullMQ stalled-check key>+<startTS>

Root cause: the 2PC abort path is not idempotent. Once an abort has
run to completion, the rollback marker !txn|rb|<primaryKey>+<startTS>
is present at commitTS = abortTS. A second abort of the same
(primaryKey, startTS) pair — from a concurrent lock-resolver race, a
retry, or a dualwrite async replay — rebuilds the same Delete
mutations on the already-tombstoned lock/intent keys and a duplicate
Put on the rollback marker. Every one of those has a latestCommitTS
= abortTS > startTS so MVCC checkConflicts returns
ErrWriteConflict.

The rollback marker's contract is "this txn was aborted". Its
payload is a deterministic single byte (txnRollbackVersion), so
multiple identical writes carry no semantic difference. The work
the retry tries to do has already been done atomically in the first
apply (ApplyMutations is a single pebble batch), so skipping the
retry is equivalent to a second-writer-wins + idempotent apply, at
no cost.

Fix: probe txnRollbackKey at the top of handleAbortRequest. When
it's already present return nil without enqueuing any mutations.
Cheap GetAt on the hot abort path; the common case (fresh abort,
marker absent) pays one extra point lookup which the pebble block
cache will serve hot.

Safety argument: the rollback marker appears in the store only via
ApplyMutations, which writes it atomically together with the
lock/intent deletes. If the marker is visible at readTS ∞, the
cleanup was visible too. There is no partial-abort state where the
marker exists but the locks remain.

Test: TestFSMAbort_SecondAbortIsIdempotent (renamed from the prior
TestFSMAbort_SecondAbortSameTimestampConflicts, whose assertion was
exactly the bug this patch fixes). Pins both same-abortTS retry and
later-abortTS retry (HLC-monotonic, the prod resolver-race path).
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c26df192-89aa-4765-96b5-55fed9f0e03e

📥 Commits

Reviewing files that changed from the base of the PR and between 9799399 and 96ec25c.

📒 Files selected for processing (2)
  • kv/fsm.go
  • kv/fsm_abort_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • kv/fsm.go

📝 Walkthrough

Walkthrough

Adds idempotent abort handling and stricter commit/rollback invariants: aborts now use existence checks for rollback markers, commits check for rollback markers and reject with ErrTxnAlreadyAborted, lock-cleanup behavior for missing locks changed, and tests plus a new sentinel error were added. (<=50 words)

Changes

Cohort / File(s) Summary
Core FSM logic
kv/fsm.go
commitApplyStartTS now uses store.ExistsAt to detect rollback markers and returns ErrTxnAlreadyAborted if present; appendRollbackRecord enforces commit-vs-rollback invariant via ExistsAt, making rollback writes idempotent and returning ErrTxnAlreadyCommitted if a commit exists; shouldClearAbortKey treats store.ErrKeyNotFound as “do not emit cleanup”; added/clarifying comment in handleAbortRequest. Attention: error propagation and ExistsAt semantics changed.
Abort behavior tests
kv/fsm_abort_test.go
Reworked tests to assert abort idempotency and partial-key cleanup: renamed test expecting idempotent second abort; added retries at later abortTS; added tests for primary-only then per-secondary aborts, a lock-resolver race scenario ensuring no orphan locks, and a test verifying commits after rollback marker are rejected with ErrTxnAlreadyAborted.
Transaction errors
kv/txn_errors.go
Added exported sentinel error ErrTxnAlreadyAborted = errors.New("txn already aborted").

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I dug through keys at break of dawn,
Found rollback marks where conflicts had drawn,
Aborts now hop back, idempotent and neat,
Commits get turned away — no double-beat,
I twitch my nose: no orphaned locks to fawn.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically summarizes the main change: making transaction aborts idempotent when a rollback marker exists, which directly addresses the root cause of the 2PC abort-retry race.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/idempotent-rollback-marker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
kv/fsm.go (1)

528-564: ⚠️ Potential issue | 🔴 Critical

Don’t let the rollback marker suppress outstanding secondary cleanup.

Line 541 returns nil solely because txnRollbackKey(primaryKey, startTS) exists, but that marker is written when the current abort batch includes the primary key; it does not prove every secondary key in later abort batches was cleaned. A primary-only abort can create the marker, then a later secondary-key abort will be skipped here and leave that secondary lock/intent behind.

Please only short-circuit after verifying the requested keys are already resolved, or continue cleaning matching outstanding locks while suppressing only the duplicate rollback-marker write. Add a regression like: prepare {primary, secondary} → abort {primary} → abort {secondary} → assert txnLockKey(secondary) and txnIntentKey(secondary) are gone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/fsm.go` around lines 528 - 564, The current early return that checks
f.store.GetAt(txnRollbackKey(...)) must be removed because the rollback marker
only proves the primary was handled, not that all secondary keys are cleaned;
instead, keep processing mutations (uniq, buildAbortCleanupStoreMutations,
ApplyMutations) but suppress only the duplicate rollback-marker Put: after
calling buildAbortCleanupStoreMutations (or inside appendRollbackRecord), detect
if the rollback marker already exists and if so do not add the txnRollbackKey
Put (or strip it out of storeMuts) while still applying remaining storeMuts and
returning their result; also add the regression test you described (prepare
{primary, secondary} → abort {primary} → abort {secondary} → assert
txnLockKey(secondary)/txnIntentKey(secondary) are removed). Ensure references:
txnRollbackKey, f.store.GetAt, uniqueMutations, buildAbortCleanupStoreMutations,
appendRollbackRecord, ApplyMutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@kv/fsm.go`:
- Around line 528-564: The current early return that checks
f.store.GetAt(txnRollbackKey(...)) must be removed because the rollback marker
only proves the primary was handled, not that all secondary keys are cleaned;
instead, keep processing mutations (uniq, buildAbortCleanupStoreMutations,
ApplyMutations) but suppress only the duplicate rollback-marker Put: after
calling buildAbortCleanupStoreMutations (or inside appendRollbackRecord), detect
if the rollback marker already exists and if so do not add the txnRollbackKey
Put (or strip it out of storeMuts) while still applying remaining storeMuts and
returning their result; also add the regression test you described (prepare
{primary, secondary} → abort {primary} → abort {secondary} → assert
txnLockKey(secondary)/txnIntentKey(secondary) are removed). Ensure references:
txnRollbackKey, f.store.GetAt, uniqueMutations, buildAbortCleanupStoreMutations,
appendRollbackRecord, ApplyMutations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b14aecdd-e92c-4157-aecf-7890bf4f58ba

📥 Commits

Reviewing files that changed from the base of the PR and between 76ea3c4 and 6457c50.

📒 Files selected for processing (2)
  • kv/fsm.go
  • kv/fsm_abort_test.go

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make transaction ABORT handling idempotent in the KV FSM to prevent abort-retry races from producing MVCC write conflicts (notably on !txn|rb|... rollback-marker keys) and related log spam in production.

Changes:

  • Add an early rollback-marker existence probe in handleAbortRequest to short-circuit repeat aborts.
  • Update/rename the abort FSM test to assert abort retries (same and later abortTS) are no-ops.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
kv/fsm.go Adds rollback-marker existence check to treat repeated aborts as idempotent.
kv/fsm_abort_test.go Replaces the prior “second abort conflicts” test with an idempotency test for abort retries.

Comment thread kv/fsm.go Outdated
Comment on lines +528 to +545
// Idempotency short-circuit: if the rollback marker for this
// (primaryKey, startTS) already exists, a previous abort already
// completed the whole cleanup atomically (ApplyMutations writes the
// rollback marker together with the lock/intent deletes in one
// batch, so the marker's presence proves cleanup ran). Without this
// guard a retry or a concurrent second lock-resolver would re-emit
// Delete mutations on already-tombstoned lock/intent keys and a
// duplicate rollback-marker Put — all three would be rejected by
// the MVCC store as write conflicts (latestCommitTS > startTS) and
// surface in prod as "secondary write failed" log spam without
// changing any state. Rollback markers are deterministic
// ({txnRollbackVersion}) so second-writer-wins would be equivalent
// anyway; skipping the work is simpler and cheaper.
if _, err := f.store.GetAt(ctx, txnRollbackKey(meta.PrimaryKey, startTS), ^uint64(0)); err == nil {
return nil
} else if !errors.Is(err, store.ErrKeyNotFound) {
return errors.WithStack(err)
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback-marker presence does not prove that all lock/intent cleanup for the current abort request’s keys has already run. In particular, ShardStore.tryAbortExpiredPrimary issues an ABORT with only the primary key (kv/shard_store.go:1093), which writes the rollback marker on the primary shard; subsequent per-key aborts for other keys on that same shard (e.g., resolveTxnLockForKey / lock resolver) must still delete those keys’ lock/intent entries. With this unconditional early-return, any later ABORT request targeting a non-primary key that hashes to the primary shard will no-op once the marker exists, potentially leaving orphaned locks/intents indefinitely.

Suggested fix: only short-circuit when the abort request is known to be a pure retry of a prior same-keyset abort. A minimal safe change is to gate this check on whether the request includes the primary key (abortingPrimary), or otherwise avoid returning early and instead skip only the rollback-marker Put while still running cleanup for the requested keys.

Copilot uses AI. Check for mistakes.
…n lock absent

The previous fix short-circuited the whole abort request on rollback-marker
presence, which can leave secondaries orphaned: ShardStore.tryAbortExpiredPrimary
issues an ABORT with only the primary key, writing the rollback marker; a later
lock-resolver abort for a secondary (same primaryKey, same startTS) would then
see the marker and skip that secondary lock/intent cleanup.

Restructure idempotency to be per-key:
- Remove the broad short-circuit in handleAbortRequest.
- shouldClearAbortKey now returns false when the lock is missing (lock and
  intent are always written/deleted together, so lock-missing iff intent-missing;
  re-emitting Deletes on tombstoned keys would just trigger MVCC conflicts).
- appendRollbackRecord skips the Put if the marker is already present
  (idempotent; skip commit-wins check on this path since commit must be
  absent when the marker exists).

Add regression tests: SecondAbortDifferentKeysCleansRemainder reproduces the
orphan scenario (fails without the fix), LockResolverRaceLeavesNoOrphan
simulates the full prod flow, SameKeysIsIdempotent preserves the original
retry-safety invariant.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements idempotency for transaction abort requests within the KV FSM. Key changes include checking for existing rollback markers before writing new ones and skipping cleanup mutations if locks are already absent, which prevents MVCC write conflicts during racing abort operations. New tests verify these idempotent behaviors and ensure secondary keys are correctly cleaned up even when a rollback marker is present. A review comment highlights a potential consistency issue where the commit path might need a reciprocal check for rollback markers to prevent a transaction from having both commit and rollback records.

Comment thread kv/fsm.go Outdated
Comment on lines +643 to +647
// Safety: if the rollback marker is present, the commit record
// must be absent for this (primaryKey, startTS) because the
// commit-wins-over-rollback check below ran when the marker was
// first written, so we skip the commit check on the idempotent
// path.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The safety claim that the commit record must be absent if the rollback marker is present relies on handleCommitRequest also performing a reciprocal check for the rollback marker. Currently, handleCommitRequest (and its helper commitApplyStartTS) only checks for the existence of a commit record.

If a COMMIT request is applied after an ABORT has already written the rollback marker and cleared the locks, handleCommitRequest will find no locks, return no mutations for the user keys, but still proceed to write a txnCommitKey. This would result in an inconsistent state where both a commit record and a rollback record exist for the same transaction. Consider adding a check for the rollback marker in the commit path to ensure that an aborted transaction cannot be partially 'committed' later.

References
  1. When performing atomic operations based on a version or state variable, do not release the lock between reading the variable and performing the conditional check to prevent race conditions and the use of stale data.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread kv/fsm.go
Comment on lines +528 to +538
// NOTE: do NOT short-circuit the whole request on rollback-marker
// presence. The marker only proves that SOME prior abort for this
// (primaryKey, startTS) ran; it does not prove cleanup ran for the
// specific keys in *this* request. In particular
// ShardStore.tryAbortExpiredPrimary issues an ABORT whose mutation
// list contains only the primary key, so a later lock-resolver
// abort for a secondary key (same primaryKey, same startTS) would
// see the marker already present and must still clean up that
// secondary's lock/intent. Idempotency is enforced per-key in
// shouldClearAbortKey (lock-missing ⇒ nothing to do) and for the
// rollback-marker Put in appendRollbackRecord.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says handleAbortRequest will probe the rollback marker and return early if it exists, but the implementation explicitly avoids that (and relies on per-key idempotency instead). Please update the PR description to match the actual behavior (or adjust the code if the early-return approach is still intended).

Copilot uses AI. Check for mistakes.
Comment thread kv/fsm.go Outdated
Comment on lines +643 to +650
// Safety: if the rollback marker is present, the commit record
// must be absent for this (primaryKey, startTS) because the
// commit-wins-over-rollback check below ran when the marker was
// first written, so we skip the commit check on the idempotent
// path.
if _, err := f.store.GetAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0)); err == nil {
return nil
} else if !errors.Is(err, store.ErrKeyNotFound) {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appendRollbackRecord returns nil as soon as a rollback marker exists, skipping the commit-record check. The comment claims the commit record “must be absent”, but that only holds at the moment the marker was first written; a (buggy or replayed) COMMIT could still write txnCommitKey afterward. Consider still checking txnCommitKey(primaryKey,startTS) even on the idempotent path and returning ErrTxnAlreadyCommitted if present (and/or adjust the safety comment to avoid asserting an invariant that isn’t enforced here).

Copilot uses AI. Check for mistakes.
Comment thread kv/fsm_abort_test.go Outdated
Comment on lines +297 to +303
// same mutation set must return nil without touching the store.
// Idempotency is enforced per-key in shouldClearAbortKey (lock
// already gone ⇒ skip) and by appendRollbackRecord (marker already
// present ⇒ skip). The prior behaviour (write-conflict on the
// rollback-marker Put) surfaced in prod as "secondary write failed"
// log spam whenever dualwrite replay or the lock resolver raced a
// completed abort.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says the second abort “must return nil without touching the store”, but the idempotent path still performs reads (e.g., GetAt in shouldClearAbortKey / appendRollbackRecord). If the intent is “no additional writes/mutations”, consider rewording to avoid implying a read-free fast path.

Suggested change
// same mutation set must return nil without touching the store.
// Idempotency is enforced per-key in shouldClearAbortKey (lock
// already gone ⇒ skip) and by appendRollbackRecord (marker already
// present ⇒ skip). The prior behaviour (write-conflict on the
// rollback-marker Put) surfaced in prod as "secondary write failed"
// log spam whenever dualwrite replay or the lock resolver raced a
// completed abort.
// same mutation set must return nil without performing additional
// writes or other store mutations.
// Idempotency is enforced per-key in shouldClearAbortKey (lock
// already gone ⇒ skip cleanup) and by appendRollbackRecord (marker
// already present ⇒ skip the write). The retry path may still read
// store state to determine that those mutations can be skipped.
// The prior behaviour (write-conflict on the rollback-marker Put)
// surfaced in prod as "secondary write failed" log spam whenever
// dualwrite replay or the lock resolver raced a completed abort.

Copilot uses AI. Check for mistakes.
Enforce the desired invariant symmetrically: for any (primaryKey,
startTS), at most one of {rollback marker, commit record} is present.

- appendRollbackRecord now re-checks the commit record on the
  idempotent (marker-already-present) path — previously that check only
  ran the first time, so a COMMIT reordered after an ABORT could escape
  detection.
- commitApplyStartTS now probes the rollback marker before writing a
  commit record and returns the new ErrTxnAlreadyAborted if one is
  present — the reciprocal guard that was missing on the commit side.
- Tighten the appendRollbackRecord comment: "commit record absent" is a
  desired invariant verified in-line, not an assumption.
- Add TestFSMAbort_CommitAfterAbortIsRejected; reword the
  SecondAbortSameKeysIsIdempotent comment since the idempotent path
  still performs GetAt reads.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Addressed gemini HIGH + 3 Copilot items. Pushed a1d8588.

gemini HIGH — commit/rollback consistency gap

  1. appendRollbackRecord now verifies the commit record is absent on both the first-time and idempotent paths (previously only on first-time). If the invariant is ever violated (e.g. buggy reordering wrote a commit after abort), abort returns ErrTxnAlreadyCommitted.
  2. commitApplyStartTS now probes txnRollbackKey(primaryKey, startTS) before writing a commit record. If the marker is present, commit is rejected with a new ErrTxnAlreadyAborted. This is the reciprocal guard that was missing — the case where a COMMIT arrives after ABORT (replay, buggy client, race) no longer silently creates the inconsistent (marker + commit both present) state.

Together these close the gap: the invariant "at most one of {rollback marker, commit record} present" is now enforced on both sides, not assumed.

Copilot #3 — PR body rewritten to accurately describe per-key idempotency (shouldClearAbortKey + appendRollbackRecord marker-skip) rather than the incorrect "probe marker and return early" description.

Copilot #4 — comment in appendRollbackRecord now states the invariant is desired and verified in-line, not assumed.

Copilot #5 — test comment at fsm_abort_test.go:303 reworded: "without performing additional writes or store mutations (reads are allowed — the idempotent path still probes for the rollback marker and commit record via GetAt)".

New test: TestFSMAbort_CommitAfterAbortIsRejected pins the symmetric guard. The flip direction is already covered by TestFSMAbort_RejectsAlreadyCommittedTxn.

Error choice: ErrTxnAlreadyAborted = errors.New("txn already aborted"), following the naming convention of the existing ErrTxnAlreadyCommitted.

Verification:

  • go test -race -count=1 ./kv/... — PASS (3.846s)
  • make lint — 0 issues

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements idempotency for transaction aborts and enforces the invariant that a transaction cannot be both committed and aborted. Key changes include adding checks for rollback markers during the commit process, verifying commit records during rollback, and ensuring that missing locks during an abort do not trigger redundant deletions. The review feedback focuses on optimizing these existence checks by using ExistsAt instead of GetAt in commitApplyStartTS and appendRollbackRecord to avoid unnecessary value cloning and simplify error handling.

Comment thread kv/fsm.go Outdated
Comment on lines +475 to +479
if _, rerr := f.store.GetAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0)); rerr == nil {
return 0, errors.WithStack(ErrTxnAlreadyAborted)
} else if !errors.Is(rerr, store.ErrKeyNotFound) {
return 0, errors.WithStack(rerr)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When checking for the existence of a rollback marker or commit record without needing the actual value, prefer using ExistsAt over GetAt. ExistsAt is more efficient as it avoids cloning the value from the underlying MVCC store. Additionally, ExistsAt returns (false, nil) if the key is not found, which simplifies the error handling by removing the need for explicit ErrKeyNotFound checks.

Suggested change
if _, rerr := f.store.GetAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0)); rerr == nil {
return 0, errors.WithStack(ErrTxnAlreadyAborted)
} else if !errors.Is(rerr, store.ErrKeyNotFound) {
return 0, errors.WithStack(rerr)
}
if exists, rerr := f.store.ExistsAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0)); rerr != nil {
return 0, errors.WithStack(rerr)
} else if exists {
return 0, errors.WithStack(ErrTxnAlreadyAborted)
}

Comment thread kv/fsm.go Outdated
Comment on lines +665 to +669
if _, err := f.store.GetAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0)); err == nil {
markerPresent = true
} else if !errors.Is(err, store.ErrKeyNotFound) {
return errors.WithStack(err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the suggestion in commitApplyStartTS, use ExistsAt here to check for the presence of the rollback marker. This avoids unnecessary memory allocation and simplifies the logic.

markerPresent, err := f.store.ExistsAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0))
	if err != nil {
		return errors.WithStack(err)
	}

Comment thread kv/fsm.go Outdated
Comment on lines 676 to 680
if _, err := f.store.GetAt(ctx, txnCommitKey(primaryKey, startTS), ^uint64(0)); err == nil {
return errors.WithStack(ErrTxnAlreadyCommitted)
} else if err != nil && !errors.Is(err, store.ErrKeyNotFound) {
return errors.WithStack(err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using ExistsAt here instead of GetAt will improve performance on the abort path by avoiding a value clone and will make the code more concise by removing the ErrKeyNotFound check.

Suggested change
if _, err := f.store.GetAt(ctx, txnCommitKey(primaryKey, startTS), ^uint64(0)); err == nil {
return errors.WithStack(ErrTxnAlreadyCommitted)
} else if err != nil && !errors.Is(err, store.ErrKeyNotFound) {
return errors.WithStack(err)
}
committed, err := f.store.ExistsAt(ctx, txnCommitKey(primaryKey, startTS), ^uint64(0))
if err != nil {
return errors.WithStack(err)
} else if committed {
return errors.WithStack(ErrTxnAlreadyCommitted)
}

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

/gemini review

Addressed all 3 medium items: replaced GetAt + errors.Is(err, ErrKeyNotFound) presence checks with store.ExistsAt in kv/fsm.go:

  1. commitApplyStartTS (~L475): rollback-marker presence check
  2. appendRollbackRecord idempotent path (~L664): rollback-marker presence check
  3. appendRollbackRecord commit-check (~L676): commit-record presence check

Semantics identical. go test -race -count=1 ./kv/... green; make lint clean.

Commit: 9799399

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
kv/fsm.go (1)

467-493: ⚠️ Potential issue | 🟡 Minor

Check the rollback marker even on idempotent COMMITs.

Line 471 gates the rollback-marker check behind !committed, so a state that already has both txnCommitKey(primaryKey,startTS) and txnRollbackKey(primaryKey,startTS) will still accept a matching COMMIT. That contradicts the invariant documented on Lines 461-465.

Proposed fix
 	recordedCommitTS, committed, err := f.txnCommitTS(ctx, primaryKey, startTS)
 	if err != nil {
 		return 0, err
 	}
+	exists, rerr := f.store.ExistsAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0))
+	if rerr != nil {
+		return 0, errors.WithStack(rerr)
+	}
+	if exists {
+		return 0, errors.WithStack(ErrTxnAlreadyAborted)
+	}
 	if !committed {
-		// No commit record yet: reject if a rollback marker is present.
-		// This catches out-of-order apply (COMMIT after ABORT), buggy
-		// clients, and replay races.
-		exists, rerr := f.store.ExistsAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0))
-		if rerr != nil {
-			return 0, errors.WithStack(rerr)
-		}
-		if exists {
-			return 0, errors.WithStack(ErrTxnAlreadyAborted)
-		}
 		return startTS, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/fsm.go` around lines 467 - 493, The rollback-marker check must run even
for idempotent COMMITs: currently the ExistsAt(txnRollbackKey(primaryKey,
startTS)) call is only executed when !committed, letting a state that has both
txnCommitKey and txnRollbackKey silently accept a matching COMMIT. Change the
logic in the block that handles txnCommitTS (the call to f.txnCommitTS and
subsequent handling) so you always probe
f.store.ExistsAt(txnRollbackKey(primaryKey, startTS), ^uint64(0)) and return
ErrTxnAlreadyAborted if it exists before accepting/returning a commit TS; keep
the recordedCommitTS vs commitTS validation and the existing return values (use
commitTS when OK).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@kv/fsm.go`:
- Around line 467-493: The rollback-marker check must run even for idempotent
COMMITs: currently the ExistsAt(txnRollbackKey(primaryKey, startTS)) call is
only executed when !committed, letting a state that has both txnCommitKey and
txnRollbackKey silently accept a matching COMMIT. Change the logic in the block
that handles txnCommitTS (the call to f.txnCommitTS and subsequent handling) so
you always probe f.store.ExistsAt(txnRollbackKey(primaryKey, startTS),
^uint64(0)) and return ErrTxnAlreadyAborted if it exists before
accepting/returning a commit TS; keep the recordedCommitTS vs commitTS
validation and the existing return values (use commitTS when OK).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8c2661f2-0c1b-4ab8-ab0d-5483f33adbc0

📥 Commits

Reviewing files that changed from the base of the PR and between 6457c50 and 9799399.

📒 Files selected for processing (3)
  • kv/fsm.go
  • kv/fsm_abort_test.go
  • kv/txn_errors.go
✅ Files skipped from review due to trivial changes (1)
  • kv/txn_errors.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • kv/fsm_abort_test.go

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances transaction handling by enforcing invariants between commit and rollback records and ensuring the idempotency of abort operations. Key changes include rejecting commits when a rollback marker is present, preventing rollbacks after a transaction has committed, and optimizing lock cleanup to avoid redundant MVCC writes. Feedback was provided regarding the necessity of explicit existence checks in appendRollbackRecord, suggesting they might be redundant if the leader already ensures timestamp consistency.

Comment thread kv/fsm.go
Comment on lines +666 to +682
markerPresent, err := f.store.ExistsAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0))
if err != nil {
return errors.WithStack(err)
}

// Verify the invariant regardless of marker presence: if a commit
// record is present for this (primaryKey, startTS), refuse to
// write (or confirm) a rollback marker. This catches out-of-order
// apply where a COMMIT somehow landed after a prior ABORT, as
// well as the normal "commit wins over rollback" race.
commitExists, err := f.store.ExistsAt(ctx, txnCommitKey(primaryKey, startTS), ^uint64(0))
if err != nil {
return errors.WithStack(err)
}
if commitExists {
return errors.WithStack(ErrTxnAlreadyCommitted)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ExistsAt calls in appendRollbackRecord appear to be performing Optimistic Concurrency Control (OCC) conflict checks. According to repository rules, these checks are not required at the store layer because the leader is responsible for issuing consistent timestamps. Consider removing these checks to simplify the logic and improve performance.

References
  1. Optimistic Concurrency Control (OCC) conflict checks are not required at the store layer because the leader is responsible for issuing consistent timestamps.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Thanks for the review, but I believe this one misreads what appendRollbackRecord is doing. The two ExistsAt calls here are not store-layer OCC conflict checks — they are application-layer invariant enforcement, and removing them would reintroduce exactly the correctness bug this PR was opened to fix.

The invariant (documented in the comment at kv/fsm.go:652-665 and the symmetric comment on commitApplyStartTS at kv/fsm.go:457-465) is:

For any (primaryKey, startTS) pair, at most one of {txnRollbackKey, txnCommitKey} must exist.

Each ExistsAt call has a specific, test-pinned purpose:

  1. ExistsAt(txnCommitKey(pk, startTS), MaxTS)ErrTxnAlreadyCommitted — rejects an abort that arrives for a txn whose COMMIT record already landed (commit-wins-over-abort). Pinned by TestFSMAbort_RejectsAlreadyCommittedTxn (kv/fsm_abort_test.go:112-143). Remove this and that test fails and the invariant breaks.

  2. ExistsAt(txnRollbackKey(pk, startTS), MaxTS) → skip Put — idempotency for the marker. Rollback markers are deterministic (written at txnRollbackVersion), and a second Put against the already-tombstoned key would be rejected by the MVCC store as a write conflict (latestCommitTS > startTS). This is what makes lock-resolver races safe. Pinned by TestFSMAbort_LockResolverRaceLeavesNoOrphan (kv/fsm_abort_test.go:433-482), which fires tryAbortExpiredPrimary plus N per-secondary lock-resolver aborts plus a redundant replay for the same (pk, startTS) and asserts no orphan lock/intent remains.

  3. The symmetric guard in commitApplyStartTS (kv/fsm.go:466-482) rejects COMMIT when a rollback marker is present (ErrTxnAlreadyAborted). Pinned by TestFSMAbort_CommitAfterAbortIsRejected (kv/fsm_abort_test.go:514-554). This is the other half of the same invariant and is exactly the reciprocal check the earlier Copilot [high] review asked for.

The OCC rule from AGENTS.md ("timestamp issuance should originate from the Raft leader to prevent skewed OCC decisions") is about mutation-path write-conflict detection — i.e. whether a user write at commitTS collides with a newer committed version. That is not what these probes do. These probes read txn-control records at MaxTS to verify a coexistence invariant before synthesizing store mutations; they do not influence MVCC conflict resolution and do not depend on any timestamp the store layer issues locally.

Keeping the checks. No code change.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements idempotency for transaction abort operations and enforces the invariant that a transaction cannot be both committed and aborted. It introduces checks for rollback markers during the commit process and commit records during the rollback process. Additionally, shouldClearAbortKey was updated to return false when a lock is missing, preventing unnecessary mutations that cause MVCC write conflicts. Feedback was provided regarding the atomicity of the state checks in appendRollbackRecord to ensure consistency and prevent potential race conditions.

Comment thread kv/fsm.go
Comment on lines +666 to +682
markerPresent, err := f.store.ExistsAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0))
if err != nil {
return errors.WithStack(err)
}

// Verify the invariant regardless of marker presence: if a commit
// record is present for this (primaryKey, startTS), refuse to
// write (or confirm) a rollback marker. This catches out-of-order
// apply where a COMMIT somehow landed after a prior ABORT, as
// well as the normal "commit wins over rollback" race.
commitExists, err := f.store.ExistsAt(ctx, txnCommitKey(primaryKey, startTS), ^uint64(0))
if err != nil {
return errors.WithStack(err)
}
if commitExists {
return errors.WithStack(ErrTxnAlreadyCommitted)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In appendRollbackRecord, the code performs two ExistsAt calls to enforce the invariant that a transaction cannot be both committed and aborted. To maintain consistency, ensure that the lock protecting these state variables is not released between the reads and the conditional check. Additionally, both ExistsAt calls must share the same start timestamp, and this timestamp should be used as the transaction's start timestamp for this read-modify-write operation to ensure a consistent snapshot view.

References
  1. In an MVCC system, all operations within a single transaction must share the same start timestamp to ensure atomicity and a consistent snapshot view.
  2. In transactional read-modify-write operations, set the transaction's start timestamp to the read timestamp to ensure a consistent snapshot and prevent race conditions.
  3. When performing atomic operations based on a version or state variable, do not release the lock between reading the variable and performing the conditional check to prevent race conditions and the use of stale data.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Responding to the HIGH review on kv/fsm.go:682 ("two ExistsAt calls need a lock / shared snapshot timestamp / startTS as readTS").

I think this review misreads the concurrency model. No code change is warranted.

Serialization argument: appendRollbackRecord is only reachable via kvFSM.Apply -> applyRequestErr -> handleRequest -> handleTxnRequest -> handleAbortRequest. hashicorp/raft drives FSM.Apply from a single goroutine (runFSM in raft@v1.7.3/fsm.go:86, consuming fsmMutateCh). The raft.FSM interface docs state explicitly: "Apply and Snapshot are always called from the same thread" (raft@v1.7.3/fsm.go:34). Therefore, between the ExistsAt on txnRollbackKey (fsm.go:666) and the ExistsAt on txnCommitKey (fsm.go:676), no other Apply invocation can run against this FSM, and thus no other writer can insert/remove either record. An additional user-space lock adds nothing; there is no state variable to protect beyond what the MVCC store already commits durably.

The only concurrent raft-side callback is FSMSnapshot.Persist, which is read-only against a captured snapshot handle (see kv/snapshot.go) and cannot mutate txnRollbackKey/txnCommitKey.

Read timestamp choice: Both reads deliberately use MaxTS (^uint64(0)) because we want to observe the latest committed state of the store at apply time, not a snapshot at the transaction's startTS. Using startTS as the read timestamp would be a bug: the commit record for this txn is written with commitTS > startTS (asserted in TestFSMAbort_AbortTSMustBeGreaterThanStartTS), so a read at startTS would never see it, defeating the whole "commit wins over rollback" invariant check at fsm.go:680. Since Apply is serialized, the two MaxTS reads are on a logically identical store snapshot - no read-skew is possible.

Tests that would catch a regression: TestFSMAbort_CommitAfterAbortIsRejected, TestFSMAbort_RejectsAlreadyCommittedTxn, TestFSMAbort_SecondAbortSameKeysIsIdempotent, TestFSMAbort_RollbackRecord in kv/fsm_abort_test.go.

If I've missed a concurrency path (a non-Apply writer of these keys, a second FSM instance sharing the store, a background rewriter, etc.), please cite the specific call site and I will reopen.

/gemini review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread kv/fsm.go Outdated
Comment on lines +663 to +665
// deterministic ({txnRollbackVersion}) and a second Put against
// the already-tombstoned key would otherwise be rejected by the
// MVCC store as a write conflict (latestCommitTS > startTS).
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the rollback-marker key is “already-tombstoned”, but rollback markers are written via Put and are not tombstones. This wording is misleading; consider rephrasing to describe the real conflict cause (a later write with applyStartTS=startTS conflicts because the key’s latestCommitTS > startTS).

Suggested change
// deterministic ({txnRollbackVersion}) and a second Put against
// the already-tombstoned key would otherwise be rejected by the
// MVCC store as a write conflict (latestCommitTS > startTS).
// deterministic ({txnRollbackVersion}) and are written as normal
// values via Put; a second Put with applyStartTS=startTS would be
// rejected by the MVCC store as a write conflict because the key's
// latestCommitTS is already greater than startTS.

Copilot uses AI. Check for mistakes.
Comment thread kv/fsm_abort_test.go Outdated
// keys and written the rollback marker, a subsequent abort over the
// same mutation set must return nil without performing additional
// writes or store mutations (reads are allowed — the idempotent path
// still probes for the rollback marker and commit record via GetAt).
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test comment says the idempotent abort path probes for the rollback marker and commit record “via GetAt”, but the implementation uses ExistsAt. Consider updating the comment to match the code to avoid confusion during future maintenance/refactors.

Suggested change
// still probes for the rollback marker and commit record via GetAt).
// still probes for the rollback marker and commit record via ExistsAt).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances transaction consistency and idempotency within the KV FSM. Key changes include adding a guard in commitApplyStartTS to prevent committing aborted transactions, refactoring appendRollbackRecord to handle idempotent rollbacks, and updating shouldClearAbortKey to skip redundant cleanup for missing locks. Comprehensive tests were added to validate these behaviors. Feedback suggests optimizing appendRollbackRecord by short-circuiting the commit record check when a rollback marker is already detected to avoid unnecessary store lookups.

Comment thread kv/fsm.go
Comment on lines +666 to +686
markerPresent, err := f.store.ExistsAt(ctx, txnRollbackKey(primaryKey, startTS), ^uint64(0))
if err != nil {
return errors.WithStack(err)
}

// Verify the invariant regardless of marker presence: if a commit
// record is present for this (primaryKey, startTS), refuse to
// write (or confirm) a rollback marker. This catches out-of-order
// apply where a COMMIT somehow landed after a prior ABORT, as
// well as the normal "commit wins over rollback" race.
commitExists, err := f.store.ExistsAt(ctx, txnCommitKey(primaryKey, startTS), ^uint64(0))
if err != nil {
return errors.WithStack(err)
}
if commitExists {
return errors.WithStack(ErrTxnAlreadyCommitted)
}

if markerPresent {
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In appendRollbackRecord, if markerPresent is true, the code still performs a check for commitExists. While this is defensive and ensures the invariant (at most one of rollback marker or commit record exists), it adds an extra ExistsAt call on the idempotent path. Given that this path is identified as a hot path in production due to log spam, consider reordering the checks or short-circuiting if markerPresent is true, provided you are confident in the invariant's enforcement elsewhere.

References
  1. Short-circuit and return early when a state is already determined to avoid redundant or expensive operations, especially in hot paths.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

@gemini-code-assist Pushing back on the appendRollbackRecord medium (kv/fsm.go:686) — keeping the commitExists check on the idempotent path is deliberate. Brief history:

  1. That check was ADDED SPECIFICALLY in response to an earlier gemini HIGH review in this same PR (commit a1d8588f, "fix(kv): reject commit-after-abort and tighten rollback idempotency"). The earlier review argued: "the safety claim relies on handleCommitRequest performing a reciprocal check for the rollback marker... if a COMMIT request is applied after an ABORT has already written the rollback marker... handleCommitRequest would miss it." Reordering/short-circuiting now reverts that fix and loses defense-in-depth.

  2. Cost: it's a single ExistsAt (presence-only, no value clone) on what is already the uncommon path (lock-resolver race, retry after a previous abort). It is NOT the hot path — the prod log spam that motivated this PR came from appendRollbackRecord's first-time path tripping on a write conflict, and the idempotent fast-return still short-circuits the marker Put before any mutation is appended. Reordering saves nothing on first-time traffic and saves one ExistsAt on an already-cold retry path.

  3. The symmetric check in commitApplyStartTS (kv/fsm.go:466-482) is the primary defense against a commit-after-abort apply ordering, but the in-line verification here in appendRollbackRecord catches buggy-reorder / replay / dual-write bugs that could bypass handleCommitRequest entirely. TestFSMAbort_CommitAfterAbortIsRejected pins exactly this behavior — a commit landing after an abort must be rejected no matter which FSM entry point sees it first.

  4. Removing it would require trusting the invariant check in commitApplyStartTS to be the only line of defense — which is exactly what Copilot warned against in the earlier round: "adjust the safety comment to avoid asserting an invariant [relying on checks elsewhere]". The current code follows that guidance by verifying in-line on both the first-time and idempotent paths.

Given 1-4, I'd like you to explicitly acknowledge whether giving up defense-in-depth (and reverting the earlier HIGH fix from a1d8588f) is the intended tradeoff before I consider the reorder. If the answer is "yes, the commitApplyStartTS check is sufficient and the symmetric in-line check is redundant", I'll weigh that against TestFSMAbort_CommitAfterAbortIsRejected's contract; if not, the current structure stays.

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thank you for the detailed explanation and for providing the context regarding the serialization of raft.FSM.Apply and the specific requirements of the invariant enforcement. I acknowledge that the in-line ExistsAt checks in appendRollbackRecord serve as a critical defense-in-depth mechanism to maintain the {rollback marker, commit record} mutual exclusion invariant, especially given the potential for out-of-order apply or replay scenarios that could bypass other guards. Given the test coverage provided and the clear rationale for why these checks are necessary for correctness, I agree that the current implementation is appropriate and no further changes are required.

@bootjp bootjp merged commit 4426f58 into main Apr 22, 2026
7 of 8 checks passed
@bootjp bootjp deleted the fix/idempotent-rollback-marker branch April 22, 2026 13:08
bootjp added a commit that referenced this pull request Apr 23, 2026
## Motivation

After #581 we still see occasional secondary-write failures in
production with payloads like `!txn|rb|!redis|ttl|...` (OCC write
conflicts racing against rollback tombstones). Today the only signal is
the `secondary write failed` log line from `recordSecondaryWriteFailure`
— the unlabelled `proxy_secondary_write_errors_total` counter tells us
something is wrong but gives no hint whether the failure is an OCC
conflict, a retry-loop giving up, a lost leader, or a timeout. Ops have
been grepping logs to triage.

## Change

- Adds `proxy_secondary_write_errors_by_reason_total{cmd,reason}`
(CounterVec) alongside the existing unlabelled counter (kept for
dashboard backwards-compat).
- New helper `classifySecondaryWriteError(err) string` maps error
strings to one of:
- `retry_limit` — `"retry limit exceeded"` (checked first, since the
retry-limit message embeds `"write conflict"`)
  - `write_conflict` — raw OCC conflict, e.g. `!txn|rb|` races
- `deadline_exceeded` — `errors.Is(err, context.DeadlineExceeded)` or
substring
  - `not_leader` — leadership lost mid-dispatch
- `txn_already_finalized` — `"txn already committed"` / `"txn already
aborted"`
  - `other` — everything else
- `recordSecondaryWriteFailure` increments both counters.
- Tests: table test for the classifier + a test asserting both counters
tick.

## Suggested Grafana query

```promql
sum by (cmd, reason) (rate(proxy_secondary_write_errors_by_reason_total[5m]))
```

A per-reason stacked graph makes the `!txn|rb|` conflict pattern (the
motivating incident) show up as a spike in `reason="write_conflict"`
without needing to touch logs.

## Test plan

- [x] `go test -race -count=1 ./proxy/...`
- [x] `make lint`


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Secondary write error monitoring now automatically classifies failures
by specific error type to improve diagnostics and observability.
* New detailed metrics track write failure patterns across different
error categories, enabling better visibility and root cause analysis.

* **Tests**
* Added comprehensive unit tests validating error classification logic
and metrics recording functionality across various failure scenarios.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants